This R Markdown sheet includes a quite large set of R code chunks taken from the RHDS book R for Health Data Science by Ewen Harrison and Riinu Pius (CRC Press, 2021), available online at https://argoshare.is.ed.ac.uk/healthyr_book/. The structure follows the structure of the book (chapters 2, 3, 4, 5, 6, and 8 and their sections and subsections).
The goal is to get familiar with R and R Markdown (in a quite quick pace) using typical R functions for data wrangling, visualization, and analysis. You do not need to do very much (yet), mostly just read or browse the book while following and exploring the work flow in RStudio, activating the R functions and studying the output (results, statistics, graphics etc.) that will be created (including some tiny errors that are included on purpose). There are also some small exercises, where you are encouraged to write a bit of R code yourself (based on the hints given in the book). You may well skip those exercises.
The rest of the Exercise sets on this IODS (Introduction to Open Data Science) course have a different, more interactive structure. They have their own logic and detailed instructions, as you will see from the next week onwards.
You should begin by reading (or browsing through) this chapter:
https://argoshare.is.ed.ac.uk/healthyr_book/why-we-love-r.html
Note1: On IODS course, we focus on writing the steps of the analysis in R Markdown, but we also use R scripts. Both ways of working are included in the RHDS book. Here, our focus is on the R Markdown (R scripts are a bit simpler).
Note2: We have already copy-pasted the R codes for you in this R Markdown sheet, beginning from Chapter 2. We have also included the example datasets of RHDS book through our GitHub repository, so all the datasets of the book are easily accessible on the fly, without a need to download them separately.
The following is the first example of an R code chunk, where you can activate R functions. Its contents have been copy-pasted from Section 1.7. Try now to activate the R “command” (“1001:1004”) by putting the cursor on that line below, and pressing Ctrl+Enter (on Windows) or Cmd+Enter (on Mac)!
# Now we're printing bigger numbers:
1001:1004## [1] 1001 1002 1003 1004
As you see above, the results (in this case, the numbers 1001, 1002, 1003, and 1004) appear immediately below the R code chunk (preceded by the usual [1], which you will also get used to, very soon.)
OK, as soon as you have read Chapter 1, you are ready to move on, to Chapter 2!
Again, the idea is that you read (or at least browse) the contents of the book while following the work flow (the R code chunks that have been created here for you).
So, begin reading at https://argoshare.is.ed.ac.uk/healthyr_book/r-basics.html.
As soon as you see R code in the book, look at your RStudio, in this R Markdown sheet. Activate the R functions one at a time and see what happens.
Have fun! This will get you ready for the rest of the IODS course!
NOTE: The function library(…) is used to call another R package to work. When there is a library(…) function involved (as there is in the code chunk below), you must be prepared to install the package (if you have not installed that before). You can do that in the RStudio “Packages” pane on the right, but also by using an R function install.packages(“…”) in this editor. In some cases (like just below), the install.packages(“…”) function is given as a comment to the library call, which is a useful trick: you can select (with mouse or arrow keys) the install.packages(“…”) and run it (by Ctrl+Enter / Cmd+Enter). It is NOT a good idea to leave the install.packages(“…”) function in the code, outside of the comment (similarly as the library, for example), as then the package would be installed unnecessarily often (which takes time and other resources). Usually, installing is done once, and then you can use the package (with library(…)) as many times as you need to.
So try this below: select (with mouse or arrow keys) the install.packages(“…”) and run it (by Ctrl+Enter / Cmd+Enter). It will install the tidyverse package collection that we will need all the time.
After an successful installation (that might take some time), continue by activating the actual functions, beginning from library(…) and then finally reading the data in:
library(tidyverse) # install.packages("tidyverse")## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
example_data <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/example_data.csv")## Rows: 3 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, group
## dbl (1): measurement
## dttm (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(example_data) # look at the data (and then close that view to return to the editor)library(tidyverse)
gbd_short <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year.csv")## Rows: 21 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): cause
## dbl (2): year, deaths_millions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(gbd_short)Note: Some outputs may differ a bit from the book, because R is constantly developed further. One example is the following. (We have modified the code chunk a bit to reveal the outputs shown in the book.)
library(tidyverse)
typesdata <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/typesdata.csv")## Rows: 3 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, group
## dbl (1): measurement
## dttm (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(typesdata)## cols(
## id = col_character(),
## group = col_character(),
## measurement = col_double(),
## date = col_datetime(format = "")
## )
typesdata## # A tibble: 3 × 4
## id group measurement date
## <chr> <chr> <dbl> <dttm>
## 1 ID1 Control 1.8 2017-01-02 12:00:00
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
typesdata_faulty <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/typesdata_faulty.csv")## Rows: 3 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): id, group, measurement, date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(typesdata_faulty)## cols(
## id = col_character(),
## group = col_character(),
## measurement = col_character(),
## date = col_character()
## )
typesdata_faulty## # A tibble: 3 × 4
## id group measurement date
## <chr> <chr> <chr> <chr>
## 1 ID1 Control 1.8 02-Jan-17 12:00
## 2 ID2 Treatment 4.5 03-Feb-18 13:00
## 3 ID3 Treatment 3.7 or 3.8 04-Mar-19 14:00
typesdata$measurement %>% mean()## [1] 3.333333
measurement_mean <- typesdata$measurement %>% mean()
measurement_mean == 3.333333## [1] FALSE
(0.10 + 0.05) == 0.15## [1] FALSE
library(tidyverse)
near(0.10+0.05, 0.15)## [1] TRUE
near(measurement_mean, 3.333333, 0.000001)## [1] TRUE
library(tidyverse)
typesdata %>%
count(group)## # A tibble: 2 × 2
## group n
## <chr> <int>
## 1 Control 1
## 2 Treatment 2
typesdata %>%
count(group, sort = TRUE)## # A tibble: 2 × 2
## group n
## <chr> <int>
## 1 Treatment 2
## 2 Control 1
# all ids are unique:
typesdata %>%
count(id, sort = TRUE)## # A tibble: 3 × 2
## id n
## <chr> <int>
## 1 ID1 1
## 2 ID2 1
## 3 ID3 1
# we add in a duplicate row where id = ID3,
# then count again:
typesdata %>%
add_row(id = "ID3") %>%
count(id, sort = TRUE)## # A tibble: 3 × 2
## id n
## <chr> <int>
## 1 ID3 2
## 2 ID1 1
## 3 ID2 1
(just read - more about this topic will follow later..)
Remember to install the lubridate package first…
library(lubridate) # install.packages("lubridate") # makes working with dates easier## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
current_datetime <- Sys.time()
current_datetime## [1] "2023-11-02 13:44:14 EET"
my_datetime <- "2020-12-01 12:00"
my_datetime## [1] "2020-12-01 12:00"
# error that is explained in the book... (you may put '#' in front of it)
# my_datetime - current_datetime
current_datetime %>% class()## [1] "POSIXct" "POSIXt"
my_datetime %>% class()## [1] "character"
2.2.4 continued…
my_datetime_converted <- ymd_hm(my_datetime)
my_datetime_converted## [1] "2020-12-01 12:00:00 UTC"
# now it will work:
my_datetime_converted - current_datetime## Time difference of -1065.989 days
my_datesdiff <- my_datetime_converted - current_datetime
my_datesdiff %>% class()## [1] "difftime"
ymd_hm("2021-01-02 12:00") + my_datesdiff## [1] "2018-02-01 12:15:45 UTC"
# another error challenge here... (again you may put '#' in front of that afterwards)
# 560/my_datesdiff
560/as.numeric(my_datesdiff)## [1] -0.5253337
2.2.4 continued…
parse_date_time("12:34 07/Jan'20", "%H:%M %d/%b'%y")## [1] "2020-01-07 12:34:00 UTC"
Sys.time()## [1] "2023-11-02 13:44:14 EET"
Sys.time() %>% format("%H:%M on %B-%d (%Y)")## [1] "13:44 on Marraskuu-02 (2023)"
Sys.time() %>% format("Happy days, the current time is %H:%M %B-%d (%Y)!")## [1] "Happy days, the current time is 13:44 Marraskuu-02 (2023)!"
library(tidyverse)
mydata <- tibble(
id = 1:4,
sex = c("Male", "Female", "Female", "Male"),
var1 = c(4, 1, 2, 3),
var2 = c(NA, 4, 5, NA),
var3 = c(2, 1, NA, NA)
)mydata## # A tibble: 4 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 3 Female 2 5 NA
## 4 4 Male 3 NA NA
mydata$var1## [1] 4 1 2 3
mean(mydata$var1)## [1] 2.5
mean(mydata$var2)## [1] NA
mean(mydata$sex)## Warning in mean.default(mydata$sex): argument is not numeric or logical:
## returning NA
## [1] NA
mean(mydata$var2, na.rm = TRUE)## [1] 4.5
Sys.time()## [1] "2023-11-02 13:44:14 EET"
a <- 103
a## [1] 103
seq(15, 30)## [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
example_sequence <- seq(15, 30)
example_sequence <- example_sequence/2
example_sequence## [1] 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
## [16] 15.0
mean_result <- mean(mydata$var2, na.rm = TRUE)library(tidyverse)
mydata$var1 %>% mean()## [1] 2.5
mean_result <- mydata$var1 %>% mean()mydata %>%
lm(var1~var2, data = .)##
## Call:
## lm(formula = var1 ~ var2, data = .)
##
## Coefficients:
## (Intercept) var2
## -3 1
gbd_short %>%
filter(year < 1995)## # A tibble: 3 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
gbd_short %>%
filter(year <= 1995)## # A tibble: 6 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
## 4 1995 Communicable diseases 15.1
## 5 1995 Injuries 4.53
## 6 1995 Non-communicable diseases 29.3
gbd_short %>%
filter(year == 1995)## # A tibble: 3 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1995 Communicable diseases 15.1
## 2 1995 Injuries 4.53
## 3 1995 Non-communicable diseases 29.3
2.5 continued…
# do you see what is wrong here? (you may fix it or hide it with '#' afterwards)
# gbd_short %>%
# filter(year = 1995)
gbd_short %>%
filter(year == 1995 | year == 2017)## # A tibble: 6 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1995 Communicable diseases 15.1
## 2 1995 Injuries 4.53
## 3 1995 Non-communicable diseases 29.3
## 4 2017 Communicable diseases 10.4
## 5 2017 Injuries 4.47
## 6 2017 Non-communicable diseases 40.9
gbd_short %>%
filter(year == max(year) | year == min(year))## # A tibble: 6 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
## 4 2017 Communicable diseases 10.4
## 5 2017 Injuries 4.47
## 6 2017 Non-communicable diseases 40.9
mydata_year2000 <- gbd_short %>%
filter(year == 2000)
mydata_year2000## # A tibble: 3 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 2000 Communicable diseases 14.8
## 2 2000 Injuries 4.56
## 3 2000 Non-communicable diseases 31.0
2.5.1 continued…
new_data_selection <- gbd_short %>%
filter((year == 1990 | year == 2013) & cause == "Communicable diseases")
new_data_selection## # A tibble: 1 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
2.5.1 continued…
# Or we can get rid of the extra brackets around the years
# by moving cause into a new filter on a new line:
new_data_selection <- gbd_short %>%
filter(year == 1990 | year == 2013) %>%
filter(cause == "Communicable diseases")
new_data_selection## # A tibble: 1 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
2.5.1 continued…
# Or even better, we can include both in one filter() call, as all
# separate conditions are by default joined with "&":
new_data_selection <- gbd_short %>%
filter(year == 1990 | year == 2013,
cause == "Communicable diseases")
new_data_selection## # A tibble: 1 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
gbd_short$cause %>% unique()## [1] "Communicable diseases" "Injuries"
## [3] "Non-communicable diseases"
gbd_short %>%
# also filtering for a single year to keep the result concise
filter(year == 1990) %>%
filter(cause == "Communicable diseases" | cause == "Non-communicable diseases")## # A tibble: 2 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Non-communicable diseases 26.7
gbd_short %>%
filter(year == 1990) %>%
filter(cause %in% c("Communicable diseases", "Non-communicable diseases"))## # A tibble: 2 × 3
## year cause deaths_millions
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Non-communicable diseases 26.7
mydata## # A tibble: 4 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 3 Female 2 5 NA
## 4 4 Male 3 NA NA
mydata %>%
filter(is.na(var2))## # A tibble: 2 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 4 Male 3 NA NA
mydata %>%
filter(!is.na(var2))## # A tibble: 2 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
## 2 3 Female 2 5 NA
mydata %>%
filter(var2 != 5)## # A tibble: 1 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
mydata %>%
filter(var2 != 5 | is.na(var2))## # A tibble: 3 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 4 Male 3 NA NA
2.7 continued…
subset1 <- mydata %>%
filter(var2 == 5)
subset2 <- mydata %>%
filter(! var2 == 5)
subset1; subset2## # A tibble: 1 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 3 Female 2 5 NA
## # A tibble: 1 × 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
nrow(mydata)## [1] 4
nrow(subset1)## [1] 1
nrow(subset2)## [1] 1
nrow(subset1) + nrow(subset2) == nrow(mydata)## [1] FALSE
subset3 <- mydata %>%
filter(is.na(var2))
nrow(subset1) + nrow(subset2) + nrow(subset3) == nrow(mydata)## [1] TRUE
typesdata## # A tibble: 3 × 4
## id group measurement date
## <chr> <chr> <dbl> <dttm>
## 1 ID1 Control 1.8 2017-01-02 12:00:00
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
typesdata$measurement## [1] 1.8 4.5 3.7
typesdata$measurement/2## [1] 0.90 2.25 1.85
typesdata %>%
mutate(measurement/2)## # A tibble: 3 × 5
## id group measurement date `measurement/2`
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
typesdata %>%
mutate(measurement_half = measurement/2)## # A tibble: 3 × 5
## id group measurement date measurement_half
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
2.8 continued…
mydata$`Nasty column name`## Warning: Unknown or uninitialised column: `Nasty column name`.
## NULL
# or
# mydata %>%
# select(`Nasty column name`)2.8 continued…
typesdata_modified <- typesdata %>%
mutate(measurement_half = measurement/2)
typesdata_modified## # A tibble: 3 × 5
## id group measurement date measurement_half
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
2.8 continued…
library(lubridate)
typesdata %>%
mutate(reference_date = ymd_hm("2020-01-01 12:00"),
dates_difference = reference_date - date) %>%
select(date, reference_date, dates_difference)## # A tibble: 3 × 3
## date reference_date dates_difference
## <dttm> <dttm> <drtn>
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094.0000 days
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00 696.9583 days
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00 302.9167 days
typesdata %>%
mutate(mean_measurement = mean(measurement))## # A tibble: 3 × 5
## id group measurement date mean_measurement
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 3.33
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 3.33
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 3.33
typesdata %>%
mutate(mean_measurement = mean(measurement)) %>%
mutate(measurement_relative = measurement/mean_measurement) %>%
select(matches("measurement"))## # A tibble: 3 × 3
## measurement mean_measurement measurement_relative
## <dbl> <dbl> <dbl>
## 1 1.8 3.33 0.54
## 2 4.5 3.33 1.35
## 3 3.7 3.33 1.11
typesdata %>%
mutate(reference_date = ymd_hm("2020-01-01 12:00"),
dates_difference = reference_date - date) %>%
mutate(dates_difference = round(dates_difference)) %>%
select(matches("date"))## # A tibble: 3 × 3
## date reference_date dates_difference
## <dttm> <dttm> <drtn>
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094 days
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00 697 days
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00 303 days
typesdata %>%
mutate(above_threshold = if_else(measurement > 3,
"Above three",
"Below three"))## # A tibble: 3 × 5
## id group measurement date above_threshold
## <chr> <chr> <dbl> <dttm> <chr>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 Below three
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 Above three
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 Above three
typesdata %>%
mutate(plot_label = paste(id,
"was last measured at", date,
", and the value was", measurement)) %>%
select(plot_label)## # A tibble: 3 × 1
## plot_label
## <chr>
## 1 ID1 was last measured at 2017-01-02 12:00:00 , and the value was 1.8
## 2 ID2 was last measured at 2018-02-03 13:00:00 , and the value was 4.5
## 3 ID3 was last measured at 2019-03-04 14:00:00 , and the value was 3.7
pastedata <- tibble(year = c(2007, 2008, 2009),
month = c("Jan", "Feb", "March"),
day = c(1, 2, 3))
pastedata## # A tibble: 3 × 3
## year month day
## <dbl> <chr> <dbl>
## 1 2007 Jan 1
## 2 2008 Feb 2
## 3 2009 March 3
2.10 continued…
pastedata %>%
mutate(date = paste(day, month, year, sep = "-"))## # A tibble: 3 × 4
## year month day date
## <dbl> <chr> <dbl> <chr>
## 1 2007 Jan 1 1-Jan-2007
## 2 2008 Feb 2 2-Feb-2008
## 3 2009 March 3 3-March-2009
library(lubridate)
pastedata %>%
mutate(date = paste(day, month, year, sep = "-")) %>%
mutate(date = dmy(date))## # A tibble: 3 × 4
## year month day date
## <dbl> <chr> <dbl> <date>
## 1 2007 Jan 1 2007-01-01
## 2 2008 Feb 2 2008-02-02
## 3 2009 March 3 2009-03-03
library(tidyverse)
patientdata <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/patient_data.csv")## Rows: 6 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (2): id, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patientdata## # A tibble: 6 × 3
## id sex age
## <dbl> <chr> <dbl>
## 1 1 Female 24
## 2 2 Male 59
## 3 3 Female 32
## 4 4 Female 84
## 5 5 Male 48
## 6 6 Female 65
labsdata <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/labs_data.csv")## Rows: 4 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): id, measurement
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
labsdata## # A tibble: 4 × 2
## id measurement
## <dbl> <dbl>
## 1 5 3.47
## 2 6 7.31
## 3 8 9.91
## 4 7 6.11
full_join(patientdata, labsdata)## Joining, by = "id"
## # A tibble: 8 × 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 6 Female 65 7.31
## 7 8 <NA> NA 9.91
## 8 7 <NA> NA 6.11
inner_join(patientdata, labsdata)## Joining, by = "id"
## # A tibble: 2 × 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 5 Male 48 3.47
## 2 6 Female 65 7.31
left_join(patientdata, labsdata)## Joining, by = "id"
## # A tibble: 6 × 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 6 Female 65 7.31
right_join(patientdata, labsdata)## Joining, by = "id"
## # A tibble: 4 × 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 5 Male 48 3.47
## 2 6 Female 65 7.31
## 3 8 <NA> NA 9.91
## 4 7 <NA> NA 6.11
patientdata_new <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/patient_data_updated.csv")## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (2): id, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patientdata_new## # A tibble: 2 × 3
## id sex age
## <dbl> <chr> <dbl>
## 1 7 Female 38
## 2 8 Male 29
bind_rows(patientdata, patientdata_new)## # A tibble: 8 × 3
## id sex age
## <dbl> <chr> <dbl>
## 1 1 Female 24
## 2 2 Male 59
## 3 3 Female 32
## 4 4 Female 84
## 5 5 Male 48
## 6 6 Female 65
## 7 7 Female 38
## 8 8 Male 29
labsdata_updated <- labsdata %>%
add_row(id = 5, measurement = 2.49)
labsdata_updated## # A tibble: 5 × 2
## id measurement
## <dbl> <dbl>
## 1 5 3.47
## 2 6 7.31
## 3 8 9.91
## 4 7 6.11
## 5 5 2.49
left_join(patientdata, labsdata_updated)## Joining, by = "id"
## # A tibble: 7 × 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 5 Male 48 2.49
## 7 6 Female 65 7.31
Well done! That was active reading of Chapter 2.
Working will continue below with Chapter 3…
Continue reading at https://argoshare.is.ed.ac.uk/healthyr_book/summarising-data.html and working with the R code chunks.
library(tidyverse)
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cause, sex, income
## dbl (2): year, deaths_millions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Creating a single-year tibble for printing and simple examples:
gbd2017 <- gbd_full %>%
filter(year == 2017)
View(gbd2017)gbd2017 %>%
# without the mutate(... = fct_relevel())
# the panels get ordered alphabetically
mutate(income = fct_relevel(income,
"Low",
"Lower-Middle",
"Upper-Middle",
"High")) %>%
# defining the variables using ggplot(aes(...)):
ggplot(aes(x = sex, y = deaths_millions, fill = cause)) +
# type of geom to be used: column (that's a type of barplot):
geom_col(position = "dodge") +
# facets for the income groups:
facet_wrap(~income, ncol = 4) +
# move the legend to the top of the plot (default is "right"):
theme(legend.position = "top")gbd2017$deaths_millions %>% sum()## [1] 55.74
gbd2017 %>%
summarise(sum(deaths_millions))## # A tibble: 1 × 1
## `sum(deaths_millions)`
## <dbl>
## 1 55.7
gbd2017 %>%
group_by(cause) %>%
summarise(sum(deaths_millions))## # A tibble: 3 × 2
## cause `sum(deaths_millions)`
## <chr> <dbl>
## 1 Communicable diseases 10.4
## 2 Injuries 4.47
## 3 Non-communicable diseases 40.9
gbd2017 %>%
group_by(cause, sex) %>%
summarise(sum(deaths_millions))## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: cause [3]
## cause sex `sum(deaths_millions)`
## <chr> <chr> <dbl>
## 1 Communicable diseases Female 4.91
## 2 Communicable diseases Male 5.47
## 3 Injuries Female 1.42
## 4 Injuries Male 3.05
## 5 Non-communicable diseases Female 19.2
## 6 Non-communicable diseases Male 21.7
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
ungroup() %>%
mutate(deaths_total = sum(deaths_per_group))## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## cause sex deaths_per_group deaths_total
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 55.7
## 2 Communicable diseases Male 5.47 55.7
## 3 Injuries Female 1.42 55.7
## 4 Injuries Male 3.05 55.7
## 5 Non-communicable diseases Female 19.2 55.7
## 6 Non-communicable diseases Male 21.7 55.7
# percent() function for formatting percentages come from library(scales)
library(scales)##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
gbd2017_summarised <- gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
ungroup() %>%
mutate(deaths_total = sum(deaths_per_group),
deaths_relative = percent(deaths_per_group/deaths_total))## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
gbd2017_summarised## # A tibble: 6 × 5
## cause sex deaths_per_group deaths_total deaths_relative
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Communicable diseases Female 4.91 55.7 8.8%
## 2 Communicable diseases Male 5.47 55.7 9.8%
## 3 Injuries Female 1.42 55.7 2.5%
## 4 Injuries Male 3.05 55.7 5.5%
## 5 Non-communicable diseases Female 19.2 55.7 34.4%
## 6 Non-communicable diseases Male 21.7 55.7 39.0%
# using values from the first row as an example:
round(100*4.91/55.74, 1) %>% paste0("%")## [1] "8.8%"
gbd2017_summarised %>%
mutate(deaths_relative = deaths_per_group/deaths_total)## # A tibble: 6 × 5
## cause sex deaths_per_group deaths_total deaths_relative
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 55.7 0.0881
## 2 Communicable diseases Male 5.47 55.7 0.0981
## 3 Injuries Female 1.42 55.7 0.0255
## 4 Injuries Male 3.05 55.7 0.0547
## 5 Non-communicable diseases Female 19.2 55.7 0.344
## 6 Non-communicable diseases Male 21.7 55.7 0.390
gbd_summarised <- gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
arrange(sex)## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
gbd_summarised## # A tibble: 6 × 3
## # Groups: cause [3]
## cause sex deaths_per_group
## <chr> <chr> <dbl>
## 1 Communicable diseases Female 4.91
## 2 Injuries Female 1.42
## 3 Non-communicable diseases Female 19.2
## 4 Communicable diseases Male 5.47
## 5 Injuries Male 3.05
## 6 Non-communicable diseases Male 21.7
gbd_summarised_sex <- gbd_summarised %>%
group_by(sex) %>%
summarise(deaths_per_sex = sum(deaths_per_group))
gbd_summarised_sex## # A tibble: 2 × 2
## sex deaths_per_sex
## <chr> <dbl>
## 1 Female 25.5
## 2 Male 30.3
3.5 continued…
full_join(gbd_summarised, gbd_summarised_sex)## Joining, by = "sex"
## # A tibble: 6 × 4
## # Groups: cause [3]
## cause sex deaths_per_group deaths_per_sex
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 25.5
## 2 Injuries Female 1.42 25.5
## 3 Non-communicable diseases Female 19.2 25.5
## 4 Communicable diseases Male 5.47 30.3
## 5 Injuries Male 3.05 30.3
## 6 Non-communicable diseases Male 21.7 30.3
gbd_summarised %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group))## # A tibble: 6 × 4
## # Groups: sex [2]
## cause sex deaths_per_group deaths_per_sex
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 25.5
## 2 Injuries Female 1.42 25.5
## 3 Non-communicable diseases Female 19.2 25.5
## 4 Communicable diseases Male 5.47 30.3
## 5 Injuries Male 3.05 30.3
## 6 Non-communicable diseases Male 21.7 30.3
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group),
sex_cause_perc = percent(deaths_per_group/deaths_per_sex)) %>%
arrange(sex, deaths_per_group)## `summarise()` has grouped output by 'cause'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## # Groups: sex [2]
## cause sex deaths_per_group deaths_per_sex sex_cause_p…¹
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Injuries Female 1.42 25.5 6%
## 2 Communicable diseases Female 4.91 25.5 19%
## 3 Non-communicable diseases Female 19.2 25.5 75%
## 4 Injuries Male 3.05 30.3 10.1%
## 5 Communicable diseases Male 5.47 30.3 18.1%
## 6 Non-communicable diseases Male 21.7 30.3 71.8%
## # … with abbreviated variable name ¹sex_cause_perc
mynumbers <- c(1, 2, NA)
sum(mynumbers)## [1] NA
sum(mynumbers, na.rm = TRUE)## [1] 3
gbd_2rows <- gbd_full %>%
slice(1:2)
gbd_2rows## # A tibble: 2 × 5
## cause year sex income deaths_millions
## <chr> <dbl> <chr> <chr> <dbl>
## 1 Communicable diseases 1990 Female High 0.21
## 2 Communicable diseases 1990 Female Upper-Middle 1.15
gbd_2rows %>%
select(cause, deaths_millions)## # A tibble: 2 × 2
## cause deaths_millions
## <chr> <dbl>
## 1 Communicable diseases 0.21
## 2 Communicable diseases 1.15
gbd_2rows %>%
select(cause, deaths = deaths_millions)## # A tibble: 2 × 2
## cause deaths
## <chr> <dbl>
## 1 Communicable diseases 0.21
## 2 Communicable diseases 1.15
gbd_2rows %>%
rename(deaths = deaths_millions)## # A tibble: 2 × 5
## cause year sex income deaths
## <chr> <dbl> <chr> <chr> <dbl>
## 1 Communicable diseases 1990 Female High 0.21
## 2 Communicable diseases 1990 Female Upper-Middle 1.15
gbd_2rows %>%
select(year, sex, income, cause, deaths_millions)## # A tibble: 2 × 5
## year sex income cause deaths_millions
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1990 Female High Communicable diseases 0.21
## 2 1990 Female Upper-Middle Communicable diseases 1.15
gbd_2rows %>%
select(year, sex, everything())## # A tibble: 2 × 5
## year sex cause income deaths_millions
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1990 Female Communicable diseases High 0.21
## 2 1990 Female Communicable diseases Upper-Middle 1.15
gbd_2rows %>%
select(starts_with("deaths"))## # A tibble: 2 × 1
## deaths_millions
## <dbl>
## 1 0.21
## 2 1.15
gbd_wide <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_wide-format.csv")## Rows: 3 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): cause
## dbl (4): Female_1990, Female_2017, Male_1990, Male_2017
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gbd_long <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex.csv")## Rows: 12 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): cause, sex
## dbl (2): year, deaths_millions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gbd_wide## # A tibble: 3 × 5
## cause Female_1990 Female_2017 Male_1990 Male_2017
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 7.3 4.91 8.06 5.47
## 2 Injuries 1.41 1.42 2.84 3.05
## 3 Non-communicable diseases 12.8 19.2 13.9 21.7
gbd_long## # A tibble: 12 × 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Communicable diseases 1990 Female 7.3
## 2 Communicable diseases 2017 Female 4.91
## 3 Communicable diseases 1990 Male 8.06
## 4 Communicable diseases 2017 Male 5.47
## 5 Injuries 1990 Female 1.41
## 6 Injuries 2017 Female 1.42
## 7 Injuries 1990 Male 2.84
## 8 Injuries 2017 Male 3.05
## 9 Non-communicable diseases 1990 Female 12.8
## 10 Non-communicable diseases 2017 Female 19.2
## 11 Non-communicable diseases 1990 Male 13.9
## 12 Non-communicable diseases 2017 Male 21.7
gbd_long %>%
pivot_wider(names_from = year, values_from = deaths_millions)## # A tibble: 6 × 4
## cause sex `1990` `2017`
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 7.3 4.91
## 2 Communicable diseases Male 8.06 5.47
## 3 Injuries Female 1.41 1.42
## 4 Injuries Male 2.84 3.05
## 5 Non-communicable diseases Female 12.8 19.2
## 6 Non-communicable diseases Male 13.9 21.7
gbd_long %>%
pivot_wider(names_from = sex, values_from = deaths_millions) %>%
mutate(Male - Female)## # A tibble: 6 × 5
## cause year Female Male `Male - Female`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 1990 7.3 8.06 0.760
## 2 Communicable diseases 2017 4.91 5.47 0.560
## 3 Injuries 1990 1.41 2.84 1.43
## 4 Injuries 2017 1.42 3.05 1.63
## 5 Non-communicable diseases 1990 12.8 13.9 1.11
## 6 Non-communicable diseases 2017 19.2 21.7 2.59
gbd_long %>%
pivot_wider(names_from = c(sex, year), values_from = deaths_millions)## # A tibble: 3 × 5
## cause Female_1990 Female_2017 Male_1990 Male_2017
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 7.3 4.91 8.06 5.47
## 2 Injuries 1.41 1.42 2.84 3.05
## 3 Non-communicable diseases 12.8 19.2 13.9 21.7
gbd_wide %>%
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
slice(1:6)## # A tibble: 6 × 3
## cause sex_year deaths_millions
## <chr> <chr> <dbl>
## 1 Communicable diseases Female_1990 7.3
## 2 Communicable diseases Female_2017 4.91
## 3 Communicable diseases Male_1990 8.06
## 4 Communicable diseases Male_2017 5.47
## 5 Injuries Female_1990 1.41
## 6 Injuries Female_2017 1.42
gbd_wide %>%
select(matches("Female|Male"))## # A tibble: 3 × 4
## Female_1990 Female_2017 Male_1990 Male_2017
## <dbl> <dbl> <dbl> <dbl>
## 1 7.3 4.91 8.06 5.47
## 2 1.41 1.42 2.84 3.05
## 3 12.8 19.2 13.9 21.7
gbd_wide %>%
# same pivot_longer as before
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)## # A tibble: 12 × 4
## cause sex year deaths_millions
## <chr> <chr> <int> <dbl>
## 1 Communicable diseases Female 1990 7.3
## 2 Communicable diseases Female 2017 4.91
## 3 Communicable diseases Male 1990 8.06
## 4 Communicable diseases Male 2017 5.47
## 5 Injuries Female 1990 1.41
## 6 Injuries Female 2017 1.42
## 7 Injuries Male 1990 2.84
## 8 Injuries Male 2017 3.05
## 9 Non-communicable diseases Female 1990 12.8
## 10 Non-communicable diseases Female 2017 19.2
## 11 Non-communicable diseases Male 1990 13.9
## 12 Non-communicable diseases Male 2017 21.7
gbd_long %>%
arrange(deaths_millions) %>%
# first 3 rows just for printing:
slice(1:3)## # A tibble: 3 × 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Injuries 1990 Female 1.41
## 2 Injuries 2017 Female 1.42
## 3 Injuries 1990 Male 2.84
gbd_long %>%
arrange(-deaths_millions) %>%
slice(1:3)## # A tibble: 3 × 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Non-communicable diseases 2017 Male 21.7
## 2 Non-communicable diseases 2017 Female 19.2
## 3 Non-communicable diseases 1990 Male 13.9
gbd_long %>%
arrange(desc(sex)) %>%
# printing rows 1, 2, 11, and 12
slice(1,2, 11, 12)## # A tibble: 4 × 4
## cause year sex deaths_millions
## <chr> <dbl> <chr> <dbl>
## 1 Communicable diseases 1990 Male 8.06
## 2 Communicable diseases 2017 Male 5.47
## 3 Non-communicable diseases 1990 Female 12.8
## 4 Non-communicable diseases 2017 Female 19.2
gbd_factored <- gbd_long %>%
mutate(cause = factor(cause))
gbd_factored$cause %>% levels()## [1] "Communicable diseases" "Injuries"
## [3] "Non-communicable diseases"
gbd_factored <- gbd_factored %>%
mutate(cause = cause %>%
fct_relevel("Injuries"))
gbd_factored$cause %>% levels()## [1] "Injuries" "Communicable diseases"
## [3] "Non-communicable diseases"
Look at Table 3.4 on page https://argoshare.is.ed.ac.uk/healthyr_book/exercises.html
gbd_long <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex.csv")## Rows: 12 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): cause, sex
## dbl (2): year, deaths_millions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Solution:
gbd_long %>%
pivot_wider(names_from = cause, values_from = deaths_millions)## # A tibble: 4 × 5
## year sex `Communicable diseases` Injuries `Non-communicable diseases`
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1990 Female 7.3 1.41 12.8
## 2 2017 Female 4.91 1.42 19.2
## 3 1990 Male 8.06 2.84 13.9
## 4 2017 Male 5.47 3.05 21.7
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cause, sex, income
## dbl (2): year, deaths_millions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(gbd_full)## Rows: 168
## Columns: 5
## $ cause <chr> "Communicable diseases", "Communicable diseases", "Com…
## $ year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
## $ sex <chr> "Female", "Female", "Female", "Female", "Male", "Male"…
## $ income <chr> "High", "Upper-Middle", "Lower-Middle", "Low", "High",…
## $ deaths_millions <dbl> 0.21, 1.15, 4.43, 1.51, 0.26, 1.35, 4.73, 1.72, 0.20, …
summary_data1 <-
gbd_full %>%
group_by(year) %>%
summarise(total_per_year = sum(deaths_millions))
summary_data1## # A tibble: 7 × 2
## year total_per_year
## <dbl> <dbl>
## 1 1990 46.3
## 2 1995 48.9
## 3 2000 50.4
## 4 2005 51.3
## 5 2010 52.6
## 6 2015 54.6
## 7 2017 55.7
summary_data2 <-
gbd_full %>%
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions))## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
summary_data2## # A tibble: 21 × 3
## # Groups: year [7]
## year cause total_per_cause
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.4
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.7
## 4 1995 Communicable diseases 15.1
## 5 1995 Injuries 4.53
## 6 1995 Non-communicable diseases 29.3
## 7 2000 Communicable diseases 14.8
## 8 2000 Injuries 4.56
## 9 2000 Non-communicable diseases 31.0
## 10 2005 Communicable diseases 13.9
## # … with 11 more rows
# Solution:
library(scales)
full_join(summary_data1, summary_data2) %>%
mutate(percentage = percent(total_per_cause/total_per_year)) ## Joining, by = "year"
## # A tibble: 21 × 5
## year total_per_year cause total_per_cause percentage
## <dbl> <dbl> <chr> <dbl> <chr>
## 1 1990 46.3 Communicable diseases 15.4 33.161%
## 2 1990 46.3 Injuries 4.25 9.175%
## 3 1990 46.3 Non-communicable diseases 26.7 57.664%
## 4 1995 48.9 Communicable diseases 15.1 30.893%
## 5 1995 48.9 Injuries 4.53 9.262%
## 6 1995 48.9 Non-communicable diseases 29.3 59.845%
## 7 2000 50.4 Communicable diseases 14.8 29.397%
## 8 2000 50.4 Injuries 4.56 9.051%
## 9 2000 50.4 Non-communicable diseases 31.0 61.552%
## 10 2005 51.3 Communicable diseases 13.9 27.102%
## # … with 11 more rows
# Solution:
gbd_full %>%
# aggregate to deaths per cause per year using summarise()
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions)) %>%
# then add a column of yearly totals using mutate()
group_by(year) %>%
mutate(total_per_year = sum(total_per_cause)) %>%
# add the percentage column
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
# select the final variables for better viewing
select(year, cause, percentage) %>%
pivot_wider(names_from = cause, values_from = percentage)## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 4
## # Groups: year [7]
## year `Communicable diseases` Injuries `Non-communicable diseases`
## <dbl> <chr> <chr> <chr>
## 1 1990 33% 9% 58%
## 2 1995 31% 9% 60%
## 3 2000 29% 9% 62%
## 4 2005 27% 9% 64%
## 5 2010 24% 9% 67%
## 6 2015 20% 8% 72%
## 7 2017 19% 8% 73%
# Solution:
gbd_full %>%
filter(year == 1990) %>%
group_by(income, sex) %>%
summarise(total_deaths = sum(deaths_millions)) %>%
pivot_wider(names_from = income, values_from = total_deaths)## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 5
## sex High Low `Lower-Middle` `Upper-Middle`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 4.14 2.22 8.47 6.68
## 2 Male 4.46 2.57 9.83 7.95
Wow! That was all for Chapter 3. GOOD JOB.
The next chapter will take us in plotting awesome graphs. Let’s proceed!
Look at Figure 4.1 at https://argoshare.is.ed.ac.uk/healthyr_book/chap04-h1.html.
You will now re-create the figure step by step!
library(tidyverse)
library(gapminder) # install.packages("gapminder")
glimpse(gapminder)## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
gapminder$year %>% unique()## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
gapminder$country %>% n_distinct()## [1] 142
gapminder$continent %>% unique()## [1] Asia Europe Africa Americas Oceania
## Levels: Africa Americas Asia Europe Oceania
gapdata2007 <- gapminder %>%
filter(year == 2007)
gapdata2007## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # … with 132 more rows
# loads the gapminder dataset from the package environment
# into your Global Environment
gapdata <- gapminder# recommended form:
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp))# NOT recommended form:
ggplot(gapdata2007, aes(x = gdpPercap, y = lifeExp))# just a schematic example of using the pipe:
#
# data %>%
# filter(...) %>%
# mutate(...) %>%
# ggplot(aes(...)) +
# ...4.2 continued…
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_point()gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point()gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1)4.2 continued…
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent)gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~pop > 50000000)gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent) +
theme_bw()theme_set(theme_bw())
library(tidyverse)
theme_set(theme_bw())gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point()gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point(shape = 1, alpha = 0.5)gapdata %>%
filter(country == "United Kingdom") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()gapdata %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()gapdata %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line()Look at Figure 4.9 on page https://argoshare.is.ed.ac.uk/healthyr_book/line-plotstime-series-plots.html
gapdata %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line()Try to transform the above graph following these instructions:
gapdata2007 %>%
filter(country %in% c("United Kingdom", "France", "Germany")) %>%
ggplot(aes(x = country, y = lifeExp)) +
geom_col() gapdata2007 %>%
count(continent)## # A tibble: 5 × 2
## continent n
## <fct> <int>
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
gapdata2007 %>%
ggplot(aes(x = continent)) +
geom_bar()gapdata2007 %>%
ggplot(aes(x = continent, colour = country)) +
geom_bar(fill = NA) +
theme(legend.position = "none")gapdata2007 %>%
ggplot(aes(x = "Global", fill = continent)) +
geom_bar()Look at Figure 4.13 on the page https://argoshare.is.ed.ac.uk/healthyr_book/bar-plots.html and try to recreate it!
gapdata2007 %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 10)gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()# (1)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_point()# (2)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_jitter()4.9 continued…
# (3)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, colour = continent)) +
geom_boxplot() +
geom_jitter()# (4)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_jitter(aes(colour = continent))label_data <- gapdata2007 %>%
group_by(continent) %>%
filter(lifeExp == max(lifeExp)) %>%
select(country, continent, lifeExp)
# since we filtered for lifeExp == max(lifeExp)
# these are the maximum life expectancy countries at each continent:
label_data## # A tibble: 5 × 3
## # Groups: continent [5]
## country continent lifeExp
## <fct> <fct> <dbl>
## 1 Australia Oceania 81.2
## 2 Canada Americas 80.7
## 3 Iceland Europe 81.8
## 4 Japan Asia 82.6
## 5 Reunion Africa 76.4
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
# First geom - boxplot
geom_boxplot() +
# Second geom - jitter with its own aes(colour = )
geom_jitter(aes(colour = continent)) +
# Third geom - label, with its own dataset (label_data) and aes(label = )
geom_label(data = label_data, aes(label = country))Try the suggested experiments given on page https://argoshare.is.ed.ac.uk/healthyr_book/multiple-geoms-multiple-aes.html with the R code above. (Copy-paste the above code chunk below to experiment with it!)
You will find solutions to Exercises 4.5.1 and 4.6.5 on the page https://argoshare.is.ed.ac.uk/healthyr_book/solutions.html
(this is all extra material, if you are curious and have some time to check it!)
gapdata %>%
filter(continent == "Europe") %>%
ggplot(aes(y = fct_reorder(country, lifeExp, .fun=max),
x = lifeExp,
colour = year)) +
geom_point(shape = 15, size = 2) +
scale_colour_distiller(palette = "Greens", direction = 1) +
theme_bw()another example:
gapdata2007 %>%
group_by(continent) %>%
mutate(country_number = seq_along(country)) %>%
ggplot(aes(x = continent)) +
geom_bar(aes(colour = continent), fill = NA, show.legend = FALSE) +
geom_text(aes(y = country_number, label = country), vjust = 1)+
geom_label(aes(label = continent), y = -1) +
theme_void()Great! That was it for Chapter 4.
The next chapter will take us further in fine tuning plots. It can be well considered EXTRA MATERIAL, so if you are curious, just go on and try it, too!
You may also skip Chapter 5 and proceed straight to Chapter 6.
You may come back in Chapter 5 anytime later, to learn the fine-tuning tricks!
https://argoshare.is.ed.ac.uk/healthyr_book/finetuning.html
library(gapminder)
library(tidyverse)
p0 <- gapminder %>%
filter(year == 2007) %>%
ggplot(aes(y = lifeExp, x = gdpPercap, colour = continent)) +
geom_point(alpha = 0.3) +
theme_bw() +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_brewer(palette = "Set1")
p0## `geom_smooth()` using formula = 'y ~ x'
p1 <- p0 + scale_x_log10()
p1## `geom_smooth()` using formula = 'y ~ x'
p2 <- p0 + expand_limits(y = 0)
p2## `geom_smooth()` using formula = 'y ~ x'
p3 <- p0 + expand_limits(y = c(0, 100))
p3## `geom_smooth()` using formula = 'y ~ x'
p4 <- p0 +
expand_limits(y = c(0, 100)) +
coord_cartesian(expand = FALSE)
p4## `geom_smooth()` using formula = 'y ~ x'
# you may install the package also by selecting the command after '#' and activating it:
library(patchwork) # install.packages("patchwork")
p1 + p2 + p3 + p4 + plot_annotation(tag_levels = "1", tag_prefix = "p")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
p5 <- p0 +
coord_cartesian(ylim = c(70, 85), xlim = c(20000, 40000))
p5## `geom_smooth()` using formula = 'y ~ x'
p6 <- p0 +
scale_y_continuous(limits = c(70, 85)) +
scale_x_continuous(limits = c(20000, 40000))
# Compare:
p5 + labs(tag = "p5") + p6 + labs(tag = "p6")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 114 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 114 rows containing missing values (`geom_point()`).
# calculating the maximum value to be included in the axis breaks:
max_value = gapminder %>%
filter(year == 2007) %>%
summarise(max_lifeExp = max(lifeExp)) %>%
pull(max_lifeExp) %>%
round(1)
# using scale_y_continuous(breaks = ...):
p7 <- p0 +
coord_cartesian(ylim = c(0, 100), expand = 0) +
scale_y_continuous(breaks = c(18, 50, max_value))
# we may also include custom labels for our breaks:
p8 <- p0 +
coord_cartesian(ylim = c(0, 100), expand = 0) +
scale_y_continuous(breaks = c(18, 50, max_value), labels = c("Adults", "50", "MAX"))
p7 + labs(tag = "p7") + p8 + labs(tag = "p8")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
p9 <- p0 +
scale_color_brewer(palette = "Paired")## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
p10 <- p0 +
scale_color_brewer("Continent - \n one of 5", palette = "Paired")## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
p9 + labs(tag = "p9") + p10 + labs(tag = "p10")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
p11 <- p0 +
scale_color_manual(values = c("red", "green", "blue", "purple", "pink"))## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
p12 <- p0 +
scale_color_manual(values = c("#8dd3c7", "#ffffb3", "#bebada",
"#fb8072", "#80b1d3"))## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
p11 + labs(tag = "p11") + p12 + labs(tag = "p12")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
p13 <- p0 +
labs(x = "Gross domestic product per capita",
y = "Life expectancy",
title = "Health and economics",
subtitle = "Gapminder dataset, 2007",
caption = Sys.Date(),
tag = "p13")
p13## `geom_smooth()` using formula = 'y ~ x'
p14 <- p0 +
annotate("text",
x = 25000,
y = 50,
label = "No points here!")
p15 <- p0 +
annotate("label",
x = 25000,
y = 50,
label = "No points here!")
p16 <- p0 +
annotate("label",
x = 25000,
y = 50,
label = "No points here!",
hjust = 0)
p14 + labs(tag = "p14") + (p15 + labs(tag = "p15"))/ (p16 + labs(tag = "p16"))## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# a value we made up for this example
# a real analysis would get it from the linear model object
fit_glance <- tibble(r.squared = 0.7693465)
plot_rsquared <- paste0(
"R^2 == ",
fit_glance$r.squared %>% round(2))
p17 <- p0 +
annotate("text",
x = 25000,
y = 50,
label = plot_rsquared, parse = TRUE,
hjust = 0)
p17 + labs(tag = "p17")## `geom_smooth()` using formula = 'y ~ x'
p18 <- p0 +
theme(axis.text.y = element_text(colour = "green", size = 14),
axis.text.x = element_text(colour = "red", angle = 45, vjust = 0.5),
axis.title = element_text(colour = "blue", size = 16)
)
p18 + labs(tag = "p18")## `geom_smooth()` using formula = 'y ~ x'
p19 <- p0 +
theme(legend.position = "none")
p20 <- p0 +
theme(legend.position = c(1,0), #bottom-right corner
legend.justification = c(1,0))
p19 + labs(tag = "p19") + p20 + labs(tag = "p20")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
5.5.2 continued…
p21 <- p0 +
guides(colour = guide_legend(ncol = 2)) +
theme(legend.position = "top") # moving to the top optional
p21 + labs(tag = "p21")## `geom_smooth()` using formula = 'y ~ x'
ggsave(p0, file = "my_saved_plot.pdf", width = 5, height = 4)## `geom_smooth()` using formula = 'y ~ x'
ggsave(p0, file = "my_saved_plot_larger.pdf", width = 10, height = 8)## `geom_smooth()` using formula = 'y ~ x'
Congrats - that was all for Chapter 5!
Chapters 6 and 8 prepare for the corresponding analysis chapters (7 and 9) in the RHDS book.
Both chapters (6 and 8) include useful things, for example, for plotting and data wrangling with continuous (Ch.6) and categorical (Ch.8) variables. Hence you should practice them through, too.
https://argoshare.is.ed.ac.uk/healthyr_book/chap06-h1.html
# Load packages (and install the finalfit package if not yet installed)
library(tidyverse)
library(finalfit) # install.packages("finalfit")
library(gapminder)
# Create object gapdata from object gapminder
gapdata <- gapminder
glimpse(gapdata) # each variable as line, variable type, first values## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
missing_glimpse(gapdata) # missing data for each variable## label var_type n missing_n missing_percent
## country country <fct> 1704 0 0.0
## continent continent <fct> 1704 0 0.0
## year year <int> 1704 0 0.0
## lifeExp lifeExp <dbl> 1704 0 0.0
## pop pop <int> 1704 0 0.0
## gdpPercap gdpPercap <dbl> 1704 0 0.0
ff_glimpse(gapdata) # summary statistics for each variable## $Continuous
## label var_type n missing_n missing_percent mean
## year year <int> 1704 0 0.0 1979.5
## lifeExp lifeExp <dbl> 1704 0 0.0 59.5
## pop pop <int> 1704 0 0.0 29601212.3
## gdpPercap gdpPercap <dbl> 1704 0 0.0 7215.3
## sd min quartile_25 median quartile_75 max
## year 17.3 1952.0 1965.8 1979.5 1993.2 2007.0
## lifeExp 12.9 23.6 48.2 60.7 70.8 82.6
## pop 106157896.7 60011.0 2793664.0 7023595.5 19585221.8 1318683096.0
## gdpPercap 9857.5 241.2 1202.1 3531.8 9325.5 113523.1
##
## $Categorical
## label var_type n missing_n missing_percent levels_n
## country country <fct> 1704 0 0.0 142
## continent continent <fct> 1704 0 0.0 5
## levels
## country -
## continent "Africa", "Americas", "Asia", "Europe", "Oceania"
## levels_count levels_percent
## country - -
## continent 624, 300, 396, 360, 24 36.6, 17.6, 23.2, 21.1, 1.4
gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = lifeExp)) + # remember aes()
geom_histogram(bins = 20) + # histogram with 20 bars
facet_grid(year ~ continent) # optional: add scales="free" gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(sample = lifeExp)) + # Q-Q plot requires 'sample'
geom_qq() + # defaults to normal distribution
geom_qq_line(colour = "blue") + # add the theoretical line
facet_grid(year ~ continent)gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
facet_wrap(~ year)6.4.3 continued…
gapdata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = factor(year), y = lifeExp)) +
geom_boxplot(aes(fill = continent)) + # add colour to boxplots
geom_jitter(alpha = 0.4) + # alpha = transparency
facet_wrap(~ continent, ncol = 5) + # spread by continent
theme(legend.position = "none") + # remove legend
xlab("Year") + # label x-axis
ylab("Life expectancy (years)") + # label y-axis
ggtitle(
"Life expectancy by continent in 2002 v 2007") # add titlettest_data <- gapdata %>% # save as object ttest_data
filter(year == 2007) %>% # 2007 only
filter(continent %in% c("Asia", "Europe")) # Asia/Europe only
ttest_result <- ttest_data %>% # example using pipe
t.test(lifeExp ~ continent, data = .) # note data = ., see below
ttest_result##
## Welch Two Sample t-test
##
## data: lifeExp by continent
## t = -4.6468, df = 41.529, p-value = 3.389e-05
## alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
## 95 percent confidence interval:
## -9.926525 -3.913705
## sample estimates:
## mean in group Asia mean in group Europe
## 70.72848 77.64860
ttest_result$p.value # Extracted element of result object## [1] 3.38922e-05
ttest_result$conf.int # Extracted element of result object## [1] -9.926525 -3.913705
## attr(,"conf.level")
## [1] 0.95
paired_data <- gapdata %>% # save as object paired_data
filter(year %in% c(2002, 2007)) %>% # 2002 and 2007 only
filter(continent == "Asia") # Asia only
paired_data %>%
ggplot(aes(x = year, y = lifeExp,
group = country)) + # for individual country lines
geom_line()6.5.3 continued…
paired_table <- paired_data %>% # save object paired_data
select(country, year, lifeExp) %>% # select vars interest
pivot_wider(names_from = year, # put years in columns
values_from = lifeExp) %>%
mutate(
dlifeExp = `2007` - `2002` # difference in means
)
paired_table## # A tibble: 33 × 4
## country `2002` `2007` dlifeExp
## <fct> <dbl> <dbl> <dbl>
## 1 Afghanistan 42.1 43.8 1.70
## 2 Bahrain 74.8 75.6 0.840
## 3 Bangladesh 62.0 64.1 2.05
## 4 Cambodia 56.8 59.7 2.97
## 5 China 72.0 73.0 0.933
## 6 Hong Kong, China 81.5 82.2 0.713
## 7 India 62.9 64.7 1.82
## 8 Indonesia 68.6 70.6 2.06
## 9 Iran 69.5 71.0 1.51
## 10 Iraq 57.0 59.5 2.50
## # … with 23 more rows
# Mean of difference in years
paired_table %>% summarise( mean(dlifeExp) )## # A tibble: 1 × 1
## `mean(dlifeExp)`
## <dbl>
## 1 1.49
paired_data %>%
t.test(lifeExp ~ year, data = ., paired = TRUE)##
## Paired t-test
##
## data: lifeExp by year
## t = -14.338, df = 32, p-value = 1.758e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.706941 -1.282271
## sample estimates:
## mean of the differences
## -1.494606
# the tidy() function comes from the package broom - install it first!
library(broom) # install.packages("broom")
gapdata %>%
filter(year == 2007) %>% # 2007 only
group_by(continent) %>% # split by continent
do( # dplyr function
t.test(.$lifeExp, mu = 77) %>% # compare mean to 77 years
tidy() # tidy into tibble
)## # A tibble: 5 × 9
## # Groups: continent [5]
## continent estimate statistic p.value parameter conf.…¹ conf.…² method alter…³
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Africa 54.8 -16.6 3.15e-22 51 52.1 57.5 One S… two.si…
## 2 Americas 73.6 -3.82 8.32e- 4 24 71.8 75.4 One S… two.si…
## 3 Asia 70.7 -4.52 7.88e- 5 32 67.9 73.6 One S… two.si…
## 4 Europe 77.6 1.19 2.43e- 1 29 76.5 78.8 One S… two.si…
## 5 Oceania 80.7 7.22 8.77e- 2 1 74.2 87.3 One S… two.si…
## # … with abbreviated variable names ¹conf.low, ²conf.high, ³alternative
# note that we're using dlifeExp
# so the differences we calculated above
t.test(paired_table$dlifeExp, mu = 0)##
## One Sample t-test
##
## data: paired_table$dlifeExp
## t = 14.338, df = 32, p-value = 1.758e-15
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1.282271 1.706941
## sample estimates:
## mean of x
## 1.494606
gapdata %>%
filter(year == 2007) %>%
filter(continent %in%
c("Americas", "Europe", "Asia")) %>%
ggplot(aes(x = continent, y=lifeExp)) +
geom_boxplot()aov_data <- gapdata %>%
filter(year == 2007) %>%
filter(continent %in% c("Americas", "Europe", "Asia"))
fit = aov(lifeExp ~ continent, data = aov_data)
summary(fit)## Df Sum Sq Mean Sq F value Pr(>F)
## continent 2 755.6 377.8 11.63 3.42e-05 ***
## Residuals 85 2760.3 32.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6.7.2 continued…
library(broom) # install.packages("broom")
gapdata %>%
filter(year == 2007) %>%
filter(continent %in% c("Americas", "Europe", "Asia")) %>%
aov(lifeExp~continent, data = .) %>%
tidy()## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 continent 2 756. 378. 11.6 0.0000342
## 2 Residuals 85 2760. 32.5 NA NA
library(ggfortify) # install.packages("ggfortify")
autoplot(fit)pairwise.t.test(aov_data$lifeExp, aov_data$continent,
p.adjust.method = "bonferroni")##
## Pairwise comparisons using t tests with pooled SD
##
## data: aov_data$lifeExp and aov_data$continent
##
## Americas Asia
## Asia 0.180 -
## Europe 0.031 1.9e-05
##
## P value adjustment method: bonferroni
pairwise.t.test(aov_data$lifeExp, aov_data$continent,
p.adjust.method = "fdr")##
## Pairwise comparisons using t tests with pooled SD
##
## data: aov_data$lifeExp and aov_data$continent
##
## Americas Asia
## Asia 0.060 -
## Europe 0.016 1.9e-05
##
## P value adjustment method: fdr
africa2002 <- gapdata %>% # save as africa2002
filter(year == 2002) %>% # only 2002
filter(continent == "Africa") %>% # only Africa
select(country, lifeExp) %>% # only these variables
mutate(
lifeExp_log = log(lifeExp) # log life expectancy
)
head(africa2002) # inspect## # A tibble: 6 × 3
## country lifeExp lifeExp_log
## <fct> <dbl> <dbl>
## 1 Algeria 71.0 4.26
## 2 Angola 41.0 3.71
## 3 Benin 54.4 4.00
## 4 Botswana 46.6 3.84
## 5 Burkina Faso 50.6 3.92
## 6 Burundi 47.4 3.86
africa2002 %>%
# pivot lifeExp and lifeExp_log values to same column (for easy plotting):
pivot_longer(contains("lifeExp")) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 15) + # make histogram
facet_wrap(~name, scales = "free") # facet with axes free to varyafrica_data <- gapdata %>%
filter(year %in% c(1982, 2007)) %>% # only 1982 and 2007
filter(continent %in% c("Africa")) # only Africa
p1 <- africa_data %>% # save plot as p1
ggplot(aes(x = lifeExp)) +
geom_histogram(bins = 15) +
facet_wrap(~year)
p2 <- africa_data %>% # save plot as p2
ggplot(aes(sample = lifeExp)) + # `sample` for Q-Q plot
geom_qq() +
geom_qq_line(colour = "blue") +
facet_wrap(~year)
p3 <- africa_data %>% # save plot as p3
ggplot(aes(x = factor(year), # try without factor(year) to
y = lifeExp)) + # see the difference
geom_boxplot(aes(fill = factor(year))) + # colour boxplot
geom_jitter(alpha = 0.4) + # add data points
theme(legend.position = "none") # remove legend
library(patchwork) # install.packages("patchwork") # great for combining plots
p1 / p2 | p3africa_data %>%
wilcox.test(lifeExp ~ year, data = .)##
## Wilcoxon rank sum test with continuity correction
##
## data: lifeExp by year
## W = 1130, p-value = 0.1499
## alternative hypothesis: true location shift is not equal to 0
library(broom)
gapdata %>%
filter(year == 2007) %>%
filter(continent %in% c("Americas", "Europe", "Asia")) %>%
kruskal.test(lifeExp~continent, data = .) %>%
tidy()## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 21.6 0.0000202 2 Kruskal-Wallis rank sum test
dependent <- "year"
explanatory <- c("lifeExp", "pop", "gdpPercap")
africa_data %>%
mutate(
year = factor(year)
) %>%
summary_factorlist(dependent, explanatory,
cont = "median", p = TRUE)## label levels 1982
## lifeExp Median (IQR) 50.8 (45.6 to 56.6)
## pop Median (IQR) 5668228.5 (1569553.8 to 9788207.8)
## gdpPercap Median (IQR) 1323.7 (828.7 to 2787.6)
## 2007 p
## 52.9 (47.8 to 59.4) 0.149
## 10093310.5 (2909226.5 to 19363654.5) 0.033
## 1452.3 (863.0 to 3993.5) 0.503
6.10 continued…
dependent <- "year"
explanatory <- c("lifeExp", "pop", "gdpPercap")
africa_data %>%
mutate(
year = factor(year)
) %>%
summary_factorlist(dependent, explanatory,
cont_nonpara = c(1, 3), # variable 1&3 are non-parametric
cont_range = TRUE, # lower and upper quartile
p = TRUE, # include hypothesis test
p_cont_para = "t.test", # use t.test/aov for parametric
add_row_totals = TRUE, # row totals
include_row_missing_col = FALSE, # missing values row totals
add_dependent_label = TRUE) # dependent label ## Dependent: year Total N 1982
## lifeExp 104 (100.0) Median (IQR) 50.8 (45.6 to 56.6)
## pop 104 (100.0) Mean (SD) 9602857.4 (13456243.4)
## gdpPercap 104 (100.0) Median (IQR) 1323.7 (828.7 to 2787.6)
## 2007 p
## 52.9 (47.8 to 59.4) 0.149
## 17875763.3 (24917726.2) 0.038
## 1452.3 (863.0 to 3993.5) 0.503
See page https://argoshare.is.ed.ac.uk/healthyr_book/exercises-1.html
You may try yourself, and then check the solutions (link below)!
Solutions to the above exercises!
https://argoshare.is.ed.ac.uk/healthyr_book/solutions-1.html
Great job! Chapter 6 DONE. Next: Chapter 8!
https://argoshare.is.ed.ac.uk/healthyr_book/chap08-h1.html
# Get the data from the boot package (that includes tools for bootstrapping methods):
meldata <- boot::melanoma # Survival from Malignant Melanomalibrary(tidyverse)
library(finalfit)
theme_set(theme_bw())
meldata %>% glimpse()## Rows: 205
## Columns: 7
## $ time <dbl> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 386,…
## $ status <dbl> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, …
## $ sex <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, …
## $ age <dbl> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 14, …
## $ year <dbl> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 1971, …
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7.41…
## $ ulcer <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
meldata %>% ff_glimpse()## $Continuous
## label var_type n missing_n missing_percent mean sd min
## time time <dbl> 205 0 0.0 2152.8 1122.1 10.0
## status status <dbl> 205 0 0.0 1.8 0.6 1.0
## sex sex <dbl> 205 0 0.0 0.4 0.5 0.0
## age age <dbl> 205 0 0.0 52.5 16.7 4.0
## year year <dbl> 205 0 0.0 1969.9 2.6 1962.0
## thickness thickness <dbl> 205 0 0.0 2.9 3.0 0.1
## ulcer ulcer <dbl> 205 0 0.0 0.4 0.5 0.0
## quartile_25 median quartile_75 max
## time 1525.0 2005.0 3042.0 5565.0
## status 1.0 2.0 2.0 3.0
## sex 0.0 0.0 1.0 1.0
## age 42.0 54.0 65.0 95.0
## year 1968.0 1970.0 1972.0 1977.0
## thickness 1.0 1.9 3.6 17.4
## ulcer 0.0 0.0 1.0 1.0
##
## $Categorical
## data frame with 0 columns and 205 rows
meldata <- meldata %>%
mutate(sex.factor = # Make new variable
sex %>% # from existing variable
factor() %>% # convert to factor
fct_recode( # forcats function
"Female" = "0", # new on left, old on right
"Male" = "1") %>%
ff_label("Sex"), # Optional label for finalfit
# same thing but more condensed code:
ulcer.factor = factor(ulcer) %>%
fct_recode("Present" = "1",
"Absent" = "0") %>%
ff_label("Ulcerated tumour"),
status.factor = factor(status) %>%
fct_recode("Died melanoma" = "1",
"Alive" = "2",
"Died - other causes" = "3") %>%
ff_label("Status"))
View(meldata) # take a look at the data!# Summary of age
meldata$age %>%
summary()## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 42.00 54.00 52.46 65.00 95.00
meldata %>%
ggplot(aes(x = age)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
meldata <- meldata %>%
mutate(
age.factor =
age %>%
cut(4)
)
meldata$age.factor %>%
summary()## (3.91,26.8] (26.8,49.5] (49.5,72.2] (72.2,95.1]
## 16 68 102 19
8.6.1 continued…
meldata <- meldata %>%
mutate(
age.factor =
age %>%
Hmisc::cut2(g=4) # Note, cut2 comes from the Hmisc package
)
meldata$age.factor %>%
summary()## [ 4,43) [43,55) [55,66) [66,95]
## 55 49 53 48
View(meldata) # take a look at the data!8.6.1 continued…
meldata <- meldata %>%
mutate(
age.factor =
age %>%
cut(breaks = c(4,20,40,60,95), include.lowest = TRUE) %>%
fct_recode(
"≤20" = "[4,20]",
"21 to 40" = "(20,40]",
"41 to 60" = "(40,60]",
">60" = "(60,95]"
) %>%
ff_label("Age (years)")
)
head(meldata$age.factor)## [1] >60 41 to 60 41 to 60 >60 41 to 60 21 to 40
## Levels: ≤20 21 to 40 41 to 60 >60
View(meldata) # take a look at the data!p1 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar() +
theme(legend.position = "none")
p2 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar(position = "fill") +
ylab("proportion")
library(patchwork)
p1 + p28.7 continued…
p1 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar(position = position_stack(reverse = TRUE)) +
theme(legend.position = "none")
p2 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar(position = position_fill(reverse = TRUE)) +
ylab("proportion")
library(patchwork)
p1 + p28.7 continued…
p1 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill=status.factor)) +
geom_bar(position = position_stack(reverse = TRUE)) +
facet_grid(sex.factor ~ age.factor) +
theme(legend.position = "none")
p2 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill=status.factor)) +
geom_bar(position = position_fill(reverse = TRUE)) +
facet_grid(sex.factor ~ age.factor)+
theme(legend.position = "bottom") +
ylab("proportion") # this line was missing in the book
p1 / p2meldata <- meldata %>%
mutate(
status_dss = fct_collapse( # dss - disease specific survival
status.factor,
"Alive" = c("Alive", "Died - other causes"))
)
View(meldata) # take a look at the data!# dss - disease specific survival
meldata$status_dss %>% levels()## [1] "Died melanoma" "Alive"
meldata <- meldata %>%
mutate(status_dss = status_dss %>%
fct_relevel("Alive")
)
meldata$status_dss %>% levels()## [1] "Alive" "Died melanoma"
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor")## label levels Alive Died melanoma
## Ulcerated tumour Absent 99 (66.9) 16 (28.1)
## Present 49 (33.1) 41 (71.9)
8.10 continued…
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness")
)## label levels Alive Died melanoma
## Ulcerated tumour Absent 99 (66.9) 16 (28.1)
## Present 49 (33.1) 41 (71.9)
## Age (years) ≤20 6 (4.1) 3 (5.3)
## 21 to 40 30 (20.3) 7 (12.3)
## 41 to 60 66 (44.6) 26 (45.6)
## >60 46 (31.1) 21 (36.8)
## Sex Female 98 (66.2) 28 (49.1)
## Male 50 (33.8) 29 (50.9)
## thickness Mean (SD) 2.4 (2.5) 4.3 (3.6)
table(meldata$ulcer.factor, meldata$status_dss) ##
## Alive Died melanoma
## Absent 99 16
## Present 49 41
# both give same result
with(meldata, table(ulcer.factor, status_dss))## status_dss
## ulcer.factor Alive Died melanoma
## Absent 99 16
## Present 49 41
8.11.1 continued…
library(magrittr)##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
meldata %$% # note $ sign here
table(ulcer.factor, status_dss)## status_dss
## ulcer.factor Alive Died melanoma
## Absent 99 16
## Present 49 41
meldata %$%
table(ulcer.factor, status_dss) %>%
prop.table(margin = 1) # 1: row, 2: column etc.## status_dss
## ulcer.factor Alive Died melanoma
## Absent 0.8608696 0.1391304
## Present 0.5444444 0.4555556
meldata %$% # note $ sign here
table(ulcer.factor, status_dss) %>%
chisq.test()##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 23.631, df = 1, p-value = 1.167e-06
library(broom)
meldata %$% # note $ sign here
table(ulcer.factor, status_dss) %>%
chisq.test() %>%
tidy()## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 23.6 0.00000117 1 Pearson's Chi-squared test with Yates' continu…
meldata %$% # note $ sign here
table(age.factor, status_dss) %>%
chisq.test()## Warning in chisq.test(.): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 2.0198, df = 3, p-value = 0.5683
meldata %$% # note $ sign here
table(age.factor, status_dss) %>%
fisher.test()##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.5437
## alternative hypothesis: two.sided
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor",
p = TRUE)## label levels Alive Died melanoma p
## Ulcerated tumour Absent 99 (66.9) 16 (28.1) <0.001
## Present 49 (33.1) 41 (71.9)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE)## Warning in chisq.test(age.factor, status_dss): Chi-squared approximation may be
## incorrect
## label levels Alive Died melanoma p
## Ulcerated tumour Absent 99 (66.9) 16 (28.1) <0.001
## Present 49 (33.1) 41 (71.9)
## Age (years) ≤20 6 (4.1) 3 (5.3) 0.568
## 21 to 40 30 (20.3) 7 (12.3)
## 41 to 60 66 (44.6) 26 (45.6)
## >60 46 (31.1) 21 (36.8)
## Sex Female 98 (66.2) 28 (49.1) 0.036
## Male 50 (33.8) 29 (50.9)
## thickness Mean (SD) 2.4 (2.5) 4.3 (3.6) <0.001
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher")## label levels Alive Died melanoma p
## Ulcerated tumour Absent 99 (66.9) 16 (28.1) <0.001
## Present 49 (33.1) 41 (71.9)
## Age (years) ≤20 6 (4.1) 3 (5.3) 0.544
## 21 to 40 30 (20.3) 7 (12.3)
## 41 to 60 66 (44.6) 26 (45.6)
## >60 46 (31.1) 21 (36.8)
## Sex Female 98 (66.2) 28 (49.1) 0.026
## Male 50 (33.8) 29 (50.9)
## thickness Mean (SD) 2.4 (2.5) 4.3 (3.6) <0.001
(The code chunk in 8.13, below the text “Further options can be included” does not work…)
meldata %>%
count(ulcer.factor, status.factor) %>%
group_by(status.factor) %>%
mutate(total = sum(n)) %>%
mutate(percentage = round(100*n/total, 1)) %>%
mutate(count_perc = paste0(n, " (", percentage, ")")) %>%
select(-total, -n, -percentage) %>%
spread(status.factor, count_perc)## # A tibble: 2 × 4
## ulcer.factor `Died melanoma` Alive `Died - other causes`
## <fct> <chr> <chr> <chr>
## 1 Absent 16 (28.1) 92 (68.7) 7 (50)
## 2 Present 41 (71.9) 42 (31.3) 7 (50)
(Solutions to these exercises are not included in the book.)
Gooood job! Chapter 8 DONE.